---
title: "Issue Sampling Plan"
author: Scott Clifford, Thomas J. Leeper, and Carlisle Rainey
output:
  html_document:
    code_folding: hide
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
---

```{r load_data}
# disproportionate sampling of questions
# setwd("c:/users/thomas/dropbox/methods/cueeffectcomparisons/roper coding")
# rmarkdown::render("sampling.Rmd", quiet = TRUE)
options(width=100)

# load packages
requireNamespace("rio", quietly = TRUE)

# load dataset of coded issues
issues <- rio::import("Roper 2016 Full_v7.xlsx", which = "Roper 2016 Full")
issues <- issues[issues$Partisanship != "N", , drop = FALSE]
```

The dataset contains codings for public opinion questions questions, which we have coded into the following hierarchical scheme:

 - Category: social, economic, or foreign policy
 - Issue: a more specific breakdown of issue
 - Policy: the specific policy being asked about in the question

Given our interest is in party cue effects - and therefore we want to study issues on which there is a split in support between parties - we exclude from the complete dataset any issue for which there is not a partisan difference in support.

# Data

There are `r nrow(issues)` total questions, yet within these data there are `r sum(duplicated(issues[c("Issue", "Policy")]))` policy duplicates. Depending on if we exclude duplicates, we obtain a dataset with `r nrow(issues[!duplicated(issues[c("Issue", "Policy")]), , drop = FALSE])` total policies. These are broken down by issue category as follows:

```{r issue_categories}
knitr::kable(cbind(names(table(issues$Category)),
                   apply(with(issues, table(Category, Issue)), 1L, function(x) sum(x != 0)),
                   table(issues$Category),
                   table(issues[!duplicated(issues[c("Issue", "Policy")]), "Category"])
                  ),
             row.names = FALSE,
             col.names = c("Category", "Issues", "Policies (w/ Duplicates)", "Policies (w/o Duplicates)"))
```

A complexity in the sampling is that the number of policies per issue is not uniform across issue categories but instead varies considerably. Across the three categories, the number of specific policies per issue is as follows (with duplicates excluded):

```{r policies_per_issue, results="asis"}
# divide by category
issues_subset <- issues[!duplicated(issues[c("Issue", "Policy")]), , drop = FALSE]
issues_list <- list(
  economic = issues_subset[issues_subset$Category == "economic",],
  'foreign policy' = issues_subset[issues_subset$Category == "foreign policy",],
  social = issues_subset[issues_subset$Category == "social",]
)
knitr::kable(with(data.frame(table(issues_subset$Issue, issues_subset$Category)), table(Var2, Freq))[,-1L])
```

The breakdown of these categories by more specific issue is as follows:

```{r issues}
issue_name_vec <- names(table(issues_subset$Issue))
tmp <- cbind.data.frame(
  issue = issue_name_vec,
  n1 = as.data.frame(table(issues[["Issue"]])[issue_name_vec])[[2L]],
  n2 = as.data.frame(table(issues_subset[["Issue"]])[issue_name_vec])[[2L]]
)
knitr::kable(tmp[order(tmp$n2, tmp$issue, decreasing = TRUE),],
             row.names = FALSE,
             col.names = c("Issue", "Policies (w/ Duplicates)", "Policies (w/o Duplicates)"))
rm(tmp)
```

For example, there are 24 specific gun control policies, 19 energy policies, and 18 national defense policies, in the data. By contrast there is only one specific policy asked on the topic of sports betting or sex education. This requires a stratified sampling strategy that is disproportionate with respect to category and issue.

```{r drop_duplicates}
# drop duplicated policies
issues <- issues_subset
issues_list <- list(
  economic = issues[issues$Category == "economic",],
  'foreign policy' = issues[issues$Category == "foreign policy",],
  social = issues[issues$Category == "social",]
)
```

# Sampling

Given we want to ask each respondent about several policies and sample the policies from the complete issue space so that we have measures of support for a modest number of total policies spread over all categories (social, economic, and foreign) and across multiple broad issue areas within this categories, some care is needed. Rather than sampling policies we could include all policies in the design but this would generate a sparse dataset with relatively few responses per specific policy. Instead, our goal is to obtain a sample of about 50 policies, of which each respondent will receive a further random sample of 5 specific policy questions.

Excluding duplicated policies, we can sample from the `r nrow(issues)` unique policies as follows. In the de-duplicated data, there are `r nrow(issues[issues$Category == "economic",])` economic policies, `r nrow(issues[issues$Category == "foreign policy",])` foreign policies, and `r nrow(issues[issues$Category == "social",])` social policies spread across `r length(unique(issues[issues$Category == "economic", "Issue"]))` economic policy issues, `r length(unique(issues[issues$Category == "foreign policy", "Issue"]))` foreign policy issues, and `r length(unique(issues[issues$Category == "social", "Issue"]))` social policy issues.

To deal with this complexity, we engage in a two-stage sampling procedure, sampling proportionately across Categories but disproportionately across issues within categories. For each category we establish a threshold indicating the target number of policies from that category based upon the total number of policies in the category. These thresholds are 16, 8, and 24 for economic, foreign, and social policies, respectively. We will sample policies from each category until these thresholds are reached, leaving us with a total of 48 policies.

```{r sampling_plan}
knitr::kable(data.frame(
  # category names
  Category = names(issues_list),
  # number of issues per category
  Issues = unlist(lapply(issues_list, function(x) length(unique(x$Issue)))),
  # number of policies per category
  Policies = unlist(lapply(issues_list, nrow)),
  # number of issues to sample from category
  Thresholds = c(16, 8, 24),
  row.names = 1:3
), row.names = FALSE)
```

To address the uneven number of policies per issue, we will sample issues without replacement from within each category. If the issue has only one policy, we sample it, decrementing the threshold of further policies to sample. Likewise, if the issue has two policies, we sample one, decrementing the threshold. If the issue has more than two policies, we sample up to three policies from it until the threshold is reached. (For example, if the threshold is 8, we have already sampled 6 policies, and we sample an issue with 5 policies, we will sample only two policies from the issue.)  We repeat this process, sampling issues without replacement. And repeat the entire process separately for each category.

```{r sampling}
# define function to do the sampling
do_sampling <-
function(
  x, # data frame of policies
  category, # category to sample from
  threshold
) {
    # subset x to issue
    this_category <- x[x$Category == category, c("QuestionID", "QuestionTxt", "Category", "Issue", "Policy", "Partisanship"), drop = FALSE]
    
    # define list to stored sampled policies
    sampled <- list()
    
    # define integer counting number of sampled policies
    n_sampled <- 0L
    
    # while `n_sampled` < threshold, sample one issue w/o replacement, then sample policies from it
    while ((nrow(this_category) >= 1L) && (n_sampled < threshold)) {
        
        # sample an issue from this category
        sampled_issue_name <- sample(unique(this_category$Issue), 1L)
        sampled_issue <- this_category[this_category$Issue %in% sampled_issue_name, , drop = FALSE]
        
        ## drop the issue from category (thus: sampling w/o replacement)
        this_category <- this_category[!this_category$Issue %in% sampled_issue_name, , drop = FALSE]
        
        # sample policies from the issue based upon number of policies in issue
        if (nrow(sampled_issue) == 1L) {
            # sample policy from issue
            sampled[[length(sampled) + 1L]] <- sampled_issue
            
            # increment `n_sampled` and `entry`
            n_sampled <- n_sampled + 1L
            
        } else if (nrow(sampled_issue) == 2L) {
            # sample policy from issue
            #sampled[[length(sampled) + 1L]] <- sampled_issue
            sampled[[length(sampled) + 1L]] <- sampled_issue[sample(seq_len(nrow(sampled_issue)), 1L), , drop = FALSE]
            
            # increment `n_sampled` and `entry`
            n_sampled <- n_sampled + 1L
            
        } else {
            # sample three issues
            sample_indices <- sample(seq_len(nrow(sampled_issue)), 3L, FALSE)
            for (i in 1:3) {
                sampled[[length(sampled) + 1L]] <- sampled_issue[sample_indices[i], , drop = FALSE]
                # increment `n_sampled` and `entry`
                n_sampled <- n_sampled + 1L
            }
        }
    }
    
    # return data frame
    do.call("rbind.data.frame", sampled)
}

# do sampling
set.seed(20180720)
sampled_policies <- list(
  economic = do_sampling(issues, "economic", threshold = 16L),
  foreign = do_sampling(issues, "foreign policy", threshold = 8L),
  social = do_sampling(issues, "social", threshold = 24L)
)
issue_sample <- do.call("rbind.data.frame", sampled_policies)
issue_sample <- issue_sample[order(issue_sample$Category, issue_sample$Issue, issue_sample$Policy), ]
rownames(issue_sample) <- seq_len(nrow(issue_sample))
```

# Final Sample

Our final set of sampled `r nrow(issue_sample)` issues - consisting of  `r nrow(sampled_policies$economic)` economic, `r nrow(sampled_policies$foreign)` foreign, and `r nrow(sampled_policies$social)` social policies - is as follows:

```{r sample}
knitr::kable(issue_sample[c("Category", "Issue", "Policy", "Partisanship")])
```

The breakdown of partisanship leanings (more favored by Democrats versus more favored by Republicans) is as follows:

```{r sample_partisanship}
knitr::kable(
  cbind(
    as.character(data.frame(table(issues$Partisanship))[[1L]]),
    data.frame(table(issues$Partisanship))[[2L]],
    data.frame(table(issue_sample$Partisanship))[[2L]]
  ),
  row.names = FALSE,
  col.names = c("Partisanship", "All Issues", "Sampled Issues")
)
```


```{r export}
## flag sampled issues in original data
issues[["Included"]] <- 0L
issues[["Included"]][issues[["QuestionID"]] %in% issue_sample[["QuestionID"]]] <- 1L

# export
if (!file.exists("Roper 2016 Sample.xlsx")) {
    rio::export(issue_sample, "Roper 2016 Sample.xlsx")
} else {
    stop("'Roper 2016 Sample.xlsx' already exists!")
}
```


# Appendix

This report was generated in the following environment:

```{r session_info}
sessionInfo()
```
